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ABSTRACT 

The problem of finding locally dense components of a graph 
is an important primitive in data analysis, with wide-ranging 
applications from community mining to spam detection and 
the discovery of biological network modules. In this paper 
we present new algorithms for finding the densest subgraph 
in the streaming model. For any e > 0, our algorithms make 
0(log 1+e n) passes over the input and find a subgraph whose 
density is guaranteed to be within a factor 2(1 + e) of the 
optimum. Our algorithms are also easily parallelizable and 
we illustrate this by realizing them in the MapReduce model. 
In addition we perform extensive experimental evaluation 
on massive real-world graphs showing the performance and 
scalability of our algorithms in practice. 

1. INTRODUCTION 

Large-scale graph processing remains a challenging prob- 
lem in data analysis. In this work we focus on the densest 
subgraph problem that forms a basic primitive for a diverse 
number of applications ranging from those in computational 
biology [36] to community mining [8, 17] and spam detec- 
tion [21]. We present algorithms that work both in the data 
streaming and distributed computing models for large scale 
data analysis and are efficient enough to generalize to graphs 
with billions of nodes and tens of billions of edges. 

As input to the densest subgraph problem, we are given a 
graph G = (V, E) and are asked to find a subset S of nodes 
that has the highest ratio of edges between pairs of nodes 
in S to the nodes in S. This basic problem can take on 
several flavors. The graph may be undirected (e.g. friend- 
ships in Facebook) or directed (e.g. followers in Twitter). 
In the latter case, the goal is to select two subsets S and T 
maximizing the number of edges from S to T normalized by 
the geometric mean of \S\ and |T|. A different line of work 
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insists that the subgraphs be large: the input is augmented 
with an integer k with the requirement that the output sub- 
set has at least k nodes. 

This simple problem has a variety of applications across 
different areas. We illustrate some examples below. 

(1) Community mining. One of the most natural appli- 
cations of the densest subgraph problem is finding structure 
in large networks. The densest subgraph problem is useful 
in identifying communities [12, 17, 32], which can then be 
leveraged to obtain better graph compression [8]. Heuris- 
tics, with no provable performance guarantees, have been 
typically used in this line of work. 

(2) Computational biology. Saha et al. [36] adapt the dens- 
est subgraph problem for finding complex patterns in the 
gene annotation graph, using approximation and flow-based 
exact algorithms. They validate this approach by show- 
ing that some of the patterns automatically discovered had 
been previously studied in the literature; for more examples, 
see [1, Chapter 18]. 

(3) Link spam detection. Gibson et al. [21] observe that 
dense subgraphs on the web often correspond to link spam, 
hence their detection presents a useful feature for search 
engine ranking; they use a heuristic method that works well 
in the data stream paradigm. 

(4) Reachability and distance query indexing. Algorithms 
for the densest subgraph problem form a crucial primitive in 
the construction of efficient indexes for reachability and dis- 
tance queries, most notably in the well-known 2-hop label- 
ing, first introduced in [14] , as well as the more recent 3-hop 
indexing [23] . To underscore the importance of practical al- 
gorithms the authors of [14] remark that the 2-approximation 
algorithm of [10] is of more practical interest than the more 
complex but exact algorithm. 

In all these applications, a good approximation to the 
densest subgraph is sufficient and is certainly more desirable 
than a heuristic without any performance guarantees. 

It is known that both the directed and the undirected 
version of the densest subgraph problem can be solved op- 
timally using parametric flow [29] or linear programming 
relaxation [10]. In the same work, Charikar [10] gave simple 
combinatorial approximation algorithms for this problem. 
On a high level, his algorithm for the undirected case greed- 
ily removes the worst node from the graph in every pass; the 
analysis shows that one of the intermediate graphs is a 2- 
approximation to the densest subgraph problem. The basic 
version of the problem provides no control over the size of 
the densest subgraph. But, if one insists on finding a large 
dense subgraph containing at least k nodes, the problem be- 
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comes NP-hard [26] . Andersen and Chellapilla [3] as well as 
Khullcr and Sana [26] show how to obtain 2-approximations 
for this version of the problem. 

While the algorithms proposed in the above line of work 
guarantee good approximation factors, they are not efficient 
when run on very large datasets. In this work we show 
how to use the principles underlying existing algorithms, 
especially [10], to develop new algorithms that can be run 
in the data stream and distributed computing models, for 
example, MapReduce; this also resolves the open problem 
posed in [1, 13]. 

1.1 Streaming and MapReduce 

As the datasets have grown to tera- and petabyte input 
sizes, two paradigms have emerged for developing algorithms 
that scale to such large inputs: streaming and MapReduce. 

In the streaming model [34], one assumes that the input 
can be read sequentially in a number of passes over the data, 
while the total amount of random access memory (RAM) 
available to the computation is sublinear in the size of the 
input. The goal is to reduce the number of passes needed, all 
the while minimizing the amount of RAM necessary to store 
intermediate results. In the case the input is a graph, the 
nodes V are known in advance, and the edges are streamed 
(it is known that most non-trivial graph problems require 
n(| V\) RAM, even if multiple passes can be used [18]). The 
challenge in streaming algorithms lies in wisely using the 
limited amount of information that can be stored between 
passes. 

Complementing streaming algorithms, MapReduce, and 
its open source implementation, Hadoop, has become the de- 
facto model for distributed computation on a massive scale. 
Unlike streaming, where a single machine eventually sees the 
whole dataset, in MapReduce, the input is partitioned across 
a set of machines, each of which can perform a series of 
computations on its local slice of the data. The process can 
then be repeated, yielding a multi-pass algorithm (See [16] 
for exact framework, and [19, 25] for theoretical models). 
It is well known that simple operations like sum and other 
holistic measures [35] as well as some graph primitives, like 
finding connected components [25], can be implemented in 
MapReduce in a work-efficient manner. The challenge lies 
in reducing the total number of passes with no machine ever 
seeing the entire dataset. 

1.2 Our contributions 

In this work we focus on obtaining efficient algorithms 
for the densest subgraph problem that can work on mas- 
sive graphs, where the graph cannot be stored in the main 
memory. 

Specifically, we show how to modify the approach of [10] 
so that the resulting algorithm makes only 0( - logn) passes 
over the data and guarantees to return an answer within a 
(2 + e) factor of optimum. We show that our algorithm only 
requires the computation of basic graph parameters (e.g., 
the degree of each node and the overall density) and thus 
can be easily parallelized — we use the MapReduce model 
to demonstrate one such parallel implementation. Finally, 
we show that despite the (2 + e) worst-case approximation 
guarantee, the algorithm's output is often nearly optimal 
on real-world graphs; moreover it can easily scale to graphs 
with billions of edges. 



2. RELATED WORK 

The densest subgraph problem lies at the core of large 
scale data mining and as such it and its variants have been 
intensively studied. Goldberg [22] was one of the first to 
formally introduce the problem of finding the densest sub- 
graph in an undirected graph and gave an algorithm that 
required O(logn) flow computations to find the optimal so- 
lution; see also [29] . Charikar [10] described a simple greedy 
algorithm and showed that it leads to a 2-approximation 
to the optimum. When augmented with a constraint re- 
quiring the solution be of size at least k, the problem be- 
comes NP-hard [26]. On the positive side, Andersen and 
Chellapilla [3] gave a 2-approximation to this version of the 
problem, and [26] gave a faster algorithm that achieves the 
same solution quality. 

In the case the underlying graph is directed, Kannan and 
Vinay [24] were the first to define the notion of density and 
gave an O(logn) approximation algorithm. This was fur- 
ther improved by Charikar [10] who showed that it can be 
solved exactly in polynomial time by solving 0(n 2 ) linear 
programs, and obtained a combinatorial 2-approximation al- 
gorithm. The latter algorithm was simplified in the work of 
Khuller and Saha [26]. 

In addition to the steady theoretical progress, there is a 
rich line of work that tailored the problem to the specific task 
at hand. Variants of densest subgraph problem have been 
used in computational biology (see, for example [1, Chapter 
14]), community mining [12, 17, 32], and even to decide 
what subset of people would form the most effective working 
group [20] . The specific problem of finding dense subgraphs 
on very large datasets was addressed in Gibson et al. [21] 
who eschewed approximation guarantees and used shingling 
approaches to find sets of nodes with high neighborhood 
overlap. 

Streaming and MapReduce. Data streaming and MapRe- 
duce have emerged as two leading paradigms for handling 
computation on very large datasets. In the data stream 
model, the input is assumed too large to fit into main mem- 
ory, and is instead streamed past one object at a time. For 
an introduction to streaming, see the excellent survey by 
Muthukrishnan [34]. When streaming graphs, the typical 
assumption is that the set of nodes is known ahead of time 
and can fit into main memory, and the edges arrive one by 
one; this is the semi-streaming model of computation [18]. 
Algorithms for a variety of graph primitives from match- 
ings [31], to counting triangles [5, 6] have been proposed 
and analyzed in this setting. 

While data streams are an efficient model of computa- 
tion for a single machine, MapReduce has become a pop- 
ular method for large-scale parallel processing. Beginning 
with the original work of Dean and Ghemawat [16], several 
algorithms have been proposed for distributed data analy- 
sis, from clustering [15] to solving set cover [13]. For graph 
problems, Karloff et al. [25] give algorithms for finding con- 
nected components and spanning trees; Suri and Vassilvit- 
skii show how to count triangles effectively [37], while Lat- 
tanzi et al. [28] and Morales et al. [33] describe algorithms 
for finding matchings on massive graphs. 

3. PRELIMINARIES 

Let G — (V, E) be an undirected graph. For a subset 
S C V , let the induced edge set be defined as E(S) — ED S 2 
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and let the induced degree of a node i £ S be defined as 
deg s (i) = \{j\(i,j)eE(S)}\. 

The following notion of graph density is classical (see, for 
example, [29, Chapter 4]). 

Definition 1 (Density, undirected). LetG=(V,E) 
be an undirected graph. Given S C V , its density p(S) is 
defined as 



P(S) 



\E(S)\ 

\s\ ■ 



The maximum density p*(G) of the graph is then 
p*(G)=max{p(S)}. 

In case the graph is weighted, the density incorporates the 
total weight of all of the edges in the induced subgraph, 

We also define density above a size threshold: given k > 
and an undirected graph G, we define 



P> k (G) 



max p(S). 

SCV, \S\>k 



For directed graphs, the density is defined as follows [24]. 
Let G = (V, E) be a directed graph. For S,TC V, where the 
subsets are not necessarily disjoint, let E(S, T) — ECi(SxT). 
We abbreviate E{{i},T) as E(i,T) and E(S, {j}) as E(S, j). 

Definition 2 (Density, directed). Let G = (V,E) 
be a directed graph. Given S,T C V, their density p(S, T) 
is defined as 



P(S,T) = 



\E(S,T)\ 



V\W\' 

The maximum density p*(G) of the graph is then 
p*(G)= max {p(S,T)}. 



4.1 Undirected graphs 

In this section we present the greedy approximation algo- 
rithm for undirected graphs. Let G = (V, E) be an undi- 
rected graph and let e > 0. The algorithm proceeds in 
passes, in every pass removing a constant fraction of the 
remaining nodes. We show that one of the intermediate 
subgraphs forms a (2 + 2e)-approximation to the densest 
subgraph. We note that the densest subgraph problem in 
undirected graphs can be solved exactly in polynomial time 
via flows or linear programming (LPs); however flow and 
LP techniques scale poorly to internet-sized graphs. We will 
show in Section 6 that despite worst-case examples, the al- 
gorithms we give yield near-optimal solutions on real-world 
graphs and are much simpler and more efficient than the 
flow/LP-based algorithms. 

Starting with the given graph G, the algorithm computes 
the current density, p{G), and removes all of the nodes (and 
their incident edges) whose degree is less than (2 + 2e) • p(G). 
If the resulting graph is non-empty, then the algorithm re- 
curses on the remaining graph, with node set denoted by S, 
again computing its density and removing all of the nodes 
whose degree is lower than the specified threshold; we de- 
note these nodes by A(S). Then, the node set reduces to 
S \ A(S), and the recursion continues in the same way. Al- 
gorithm 1 presents the complete description. 

Algorithm 1 Densest subgraph for undirected graphs. 



Require 

1 



(V, E) and e > 



G = 

S,S<r- V 

while S / do 

A(S) ^{ieS | deg s (i) < 2(1 + e)p(S)} 

S <-S\ A(S) 

if p(S) > p(S) then 
S-S- S 

end if 
end while 
return 5" 



Approximation. For a > 1, an algorithm is said to ob- 
tain an a -approximation to the undirected densest subgraph 
problem if it outputs a subset SCV such that p(S) > 
p*(G)/a. An analogous definition can be made for the di- 
rected case. 

4. ALGORITHMS 

In this section we present streaming algorithms for find- 
ing approximately densest subgraphs. For any e > 0, we 
obtain a (2 + 2e)-approximation algorithm for the case of 
both undirected graphs (Section 4.1) and directed graphs 
(Section 4.3) and a (3 + 3e)-approximation algorithm when 
the densest subgraph is prescribed to be more than a cer- 
tain size (Section 4.2). All of our algorithms make O(logn) 
passes over the input graph and use 0(n) main memory. 

Our algorithms are motivated by Charikar's greedy algo- 
rithm for the densest subgraph problem [10] and the MapRe- 
duce algorithm for maximum coverage [7, 13]. They work 
by carefully relaxing the greedy constraint in a way that al- 
most preserves the approximation factor, yet exponentially 
decreases the number of passes. We also show a lower bound 
on the space required by any streaming algorithm to obtain 
a constant-factor approximation. 



Clearly, this algorithm can be implemented in a streaming 
fashion using only 0(n) memory since we only need to store 
and update the current node degrees to compute the density 
and to decide which nodes to remove. We now analyze the 
approximation factor of the algorithm and its running time. 

Lemma 3. Algorithm 1 obtains a (2 + 2e)- approximation 
to the densest subgraph problem. 

Proof. As the algorithm proceeds, the density of the 
remaining graph is non-monotonic, a fact that we observe 
experimentally in Section 6. We will show, however, that one 
of the intermediate subgraphs is a (2 + 2e)-approximation to 
the optimal solution. 

To proceed, fix some optimal solution S* , s.t. p(S") = 
p*(G). First, we note that for each i £ S* , deg s * (i) > p(S*): 
indeed, by the optimality of S* , for any i £ S* , we have 

^=p(sn> P (s*\{i}) = ^^^. 

(4.1) 

Since ~^2 i£S deg s (i) = 2\S\p(S), at least one node must 
be removed in every pass. Now, consider the first time in 
the pass when a node i from the optimal solution S* is 
removed, i.e. A(S) n S* ^ 6; this moment is guaranteed to 
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exist, since S eventually becomes empty. Clearly, S 3 S* . 
Let i £ A(S) nS". We have 



p(S*) <deg s ,(i) 

< deg s (i) 

< (2 + 2e)p(5). 



••• (4-1) 

■;SDS* 
■:i€ A(S) 



This implies p(S) > p(S*)/(2 + 2e) and hence the algorithm 
outputs a (2 + 2e)-approximation. □ 

Next, we show that the algorithm removes a constant frac- 
tion of all of the nodes in every pass, and thus is guaranteed 
to terminate after O(logn) passes of the while loop. 

Lemma 4. Algorithm 1 terminates in 0(log 1+e n) passes. 
Proof. At each step of the pass, we have 

2\E{S)\ = Yl de SsW+ E de §sW 

ieA(s) ies\A(s) 

> 2(1 + e)(\S\-\A(S)\)p(S) 
AE(S)\ 



2(l + e)(|5|-|A(5)|)i 



\S\ 



where the second inequality follows by considering only those 
nodes in S\A(S). Thus, 



\A(S)\ > 



1 + e 



\S\. 



(4.2) 



Equivalently, 



\S\A(S)\< 



1 + e 



\s\. 



Therefore, the cardinality of the remaining set 5* decreases 
by a factor at least 1/(1 + e) during each pass. Hence, the 
algorithm terminates in 0(log 1+e n) passes. □ 

Notice that for small e, log(l + e) ^ e and hence the number 
of passes is 0( \ logn). 

4.1.1 Lower bounds 

In this section we show that our analysis is tight. In par- 
ticular, we show that there are graphs on which Algorithm 
1 makes f2(logn) passes. Furthermore, we also show that 
any algorithm that achieves a 2-approximation in O(logn) 
passes must use $l(n/logn) space. Note that Algorithm 
1 comes close to this lower bound since it makes O(logn) 
passes and uses 0(n) memory. 

Pass lower bound. We show that the analysis of the num- 
ber of passes is tight up to constant factors. We begin with 
a slightly weaker result. 

Lemma 5. There exists an unweighted graph on which Al- 
gorithm 1 requires ^( lo ' g f g n ) passes. 

Proof. The graph consists of k disjoint subsets G\, . . ., 
Gk, where Gi is a 2 l ~ regular graph on \Vi\ = 2 2h+1 -' 
nodes, hence every Gi has exactly 2 2k ~ 1 edges and has den- 
sity of 2 i ~ 2 . For any I > 1, let G>e = \J i>e d. The density 
G>i> is: 

(k-£+l)2 2k - 1 



p(G> 



2k+i(2k-e+i _ 



(k-e +l)2 l 



We claim that in every pass the algorithm removes 0(log k) 
of these subgraphs. Suppose that we start with the subgraph 



G>i at the beginning of the pass. Then the nodes in ^4(5*) 
are exactly those that have their degree less than p(G>t)(2 + 
e) « (k - i + l)2 l ~ 2 . Since a node in G t has degree 2 i_1 , 
this is equivalent to nodes in Gi for i < (£ — 1) + log(fc — £), 
and hence the subgraph in the next pass is G>^ + i og ( fc _£)_i . 

Thus, the algorithm will take at least £l(k/ log k) passes 
to complete. Since k = B(logn), the proof follows. □ 

To show an example on which Algorithm 1 needs f2(logn) 
passes, we appeal to weighted graphs. Note that Algorithm 
1 and the analysis easily generalize to finding the maximum 
density subgraph in an undirected weighted graph. 

Lemma 6. There exists a weighted graph on which Algo- 
rithm 1 requires f2(logn) passes. 

Proof Sketch. Consider a graph whose degree sequence 
follows a power law with exponent < a < 1, i.e., if d; is 
the ith largest degree, then di oc i~ a . We have X/ILi — 
J ( ™ x~ a dx = n 1 _ a , so if the graph has m edges, we (approx- 
imately) have di = ^"^i-c m ' ■ Hence, in the first pass of 
the algorithm, we remove all the nodes with 



di 



(1 - a)i- a m 



<(2 + e)' 



Hence, the nodes such that 



i < 



I- a 
2 + e 



l/a 



go to the next pass; note that this is a constant fraction of 
the nodes. As long as the power law property of the degree 
sequence is preserved after removing the low degree nodes 
in each pass, we obtain the desired f2(logn) lower bound. 

Consider the graphs generated by the preferential attach- 
ment process [2]. To avoid the stochasticity in the model, 
which only makes the analysis more complicated, one can 
consider the following deterministic variant of this process: 
whenever a new node u arrives, it adds an edge to all of 
the existing nodes v and assigns a weight w u ,v to the edge 
(u, v) which is proportional to the current degree of v. Then 
degree of the ith node after a total of n nodes have arrived 
follows a power law distribution which is exactly what we 
needed to achieve. □ 



Space lower bound. We show that the trade off between 
memory and number of passes is almost the best possible. 
Namely, any constant-pass streaming algorithm for approx- 
imating the densest subgraph to within a constant factor of 
2 must use a linear amount of memory, and an algorithm 
making O(logn) passes must use f2(n/logn) memory. 

Lemma 7. Any p-pass streaming a -approximation algo- 
rithm for the densest subgraph problem, where a > 2, needs 
fl(n/(pa 2 )) space. 

PROOF. Consider the standard disjointness problem in 
the (/-party arbitrary round communication model. There 
are q > 2 players, and the jth player has the n-bit vector 
Xji, . . . , Xj n . Their goal is to decide if there is an index i 
such that A^ =1 Xji = 1. It is known that this problem needs 
Q(n/q) communication [4, 9] and the lower bound holds even 
under the promise that either the bit vectors are pairwisc 
disjoint (NO instance) or they have a unique common ele- 
ment but are otherwise disjoint (YES instance). 
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Given such an instance of disjointness, we construct the 
following densest subgraph instance. The overall graph G = 
(V,E) consists of n disjoint subgraphs Gi, . . . ,G n . Each 
graph d = (Vi, E t ) has q nodes, Vi = . . . , «%,;}. For 

each i, if Xji = 1, then the jih player adds the q — 1 edges 



{(:> 



j'€ [q],f ^j} to Ei. 



It is easy to see that, given the promise of a pairwise 
disjoint instance, each graph Gi is a star in the case of a NO 
instance. In the case of a YES instance, one of the Gi's is a 
complete graph. Therefore, p(G) = (q — 1) if and only if it 
is a YES instance and p(G) = 1 — 1/q otherwise. 

By setting a — q and using a standard reduction from 
streaming to communication we can conclude that if a p-pass 
streaming algorithm uses o(n/(a 2 p)) memory and obtains 
an Q-approximation to the densest subgraph, then it can be 
used on G to decide the disjointness instance, using o(n) 
communication. Given the communication lower bound for 
disjointness, the space lower bound for the densest subgraph 
follows. □ 

4.2 Large dense subgraphs 

In this section we show that a small modification of the 
algorithm presented in Section 4.1 gives a good approxima- 
tion to the problem of finding densest subgraphs above a 
prescribed size, k. 

The main difference from Algorithm 1 comes from the 
fact that instead of removing all of the nodes with degree 
less than 2(l + e)p(S), we only remove j^i £1 of them. Intu- 
itively, by removing the smallest number of nodes necessary 
to guarantee the fast convergence of the algorithm, we make 
sure that at least one of the graphs under consideration has 
approximately k nodes. Algorithm 2 contains the complete 
description. 

Algorithm 2 Large densest subgraphs. 
Require: G = (V, E), k > 0, and e > 
1 

2 
3 
4 
5 
6 
7 



9 

10 



S, S+- V 
while S do 

A(S) <- {i eS | deg s (i) < 2(1 + e)p(S)} 
Let A(S) C A(S), with \A(S)\ = j^\S\ 
S^S\A(S) 

if \S\ > k and p(S) > p(S) then 
S <- S 

end if 
end while 
return S 



To prove the approximation ratio of the algorithm, we use 
the following notation from [3] . 

Definition 8. The d-core of G, denoted Cd{G), is the 
largest induced subgraph of G with all degrees larger or equal 
to d. 

Theorem 9. Algorithm 2 is a (3 + 3e)- approximation al- 
gorithm for the problem of finding p>k(G). 

Proof. The fact that \S\ > k is obvious from the def- 
inition of the algorithm. Let S* = argmaxp> fe (G) and 
p* = p(S*), and /3 — 3 ( 1 + e ) ■ Let S be the first set gen- 
erated during the algorithm such that p(S) > f p* = 3( ^ +e f ■ 
Such a set must exist, since p* >k < p*(G) and we saw in 



Lemma 3 that at least one of the generated sets has density 
at least 2 (i+ e ) • ^ 1^1 — ^> tnen we are done. 

Otherwise, consider the case when \S\ < k. For any set S' 
generated before S during the algorithm, we have p(S') < 
^p* . Then, for any node i ^ S, we have di < 2(1 + e)| p* = 
(1 + e)/3p* , where di is the degree of i at the time it got 
removed during the algorithm. Thus, none of those nodes 
can be in the core C(i +e )p p * (G). Therefore, C(^ +e )p p * (G) C 
S, and: 



\E(S)\ > \E(C (1+e) p p ,(G))\ 
> \E(C (1+e) e p .(S*))\ 
>(l-(l + e)P)\E(S*)\. 



'.' C(i +E)(9p »(G) C S 

'.' [3, Lemma 2] 

(4.3) 



Now, let S be the last set generated during the algorithm 
such that \S\ > k. We will show that p(S) is a (3 + 3e)- 
approximation to p*. Since we remove at most nodes, 

we have k > \S - A(S)\ = JfL, i.e., |5| < (1 + e)k. Also, 
SCS, hence \E(S)\ > \E(S)\. 



Therefore, 

\E(S) \ ^ \E(y>\ 

\s\ 



> 



(l + e)k 



> l-(l + e)/? 

1 + 6 



:SCS,\S\ < (l + e)fc 
••• (4-3) 



3(1+6) 



□ 



Although the algorithm above has a worse performance 
guarantee than Algorithm 1, this is only true in the case 
when the densest subgraph has fewer than k nodes. In the 
case that the densest subgraph on G has at least k nodes 
then the above algorithm performs on par with Algorithm 
1. 

Lemma 10. Let \S*\ > k, where S* = arg max p> fc (G). 
Then, Algorithm 2 achieves a (2 + 2e)- approximation for 
P> k {G). 

PROOF. If > k, and p* — p(S*), then one can see 
that for any i £ S* , deg s . (i) > p* . Now, consider the 
first set S generated during the algorithm such that A(S) n 
S* =fc 0. Since the final set generated by the algorithm has 
cardinality k, and \S*\ > k, such a set definitely exists. For 
the considered set S, if i € A(S) f)S* , then by the definition 
of A(S), we have 

2(l + 6)p(S) > deg s (i) > deg s .(t) > p\ 

completing the proof. □ 

Finally, to bound the number of passes, note that once 
the remaining subgraph has fewer than k nodes, we can 
safely terminate the algorithm and return the best set seen 
so far. Together with Lemma 4 this immediately leads to 
the following. 

Lemma 11. Algorithm 2 terminates in 0(log 1+e ^) passes. 

4.3 Directed graphs 

In this section we obtain a (2 + 2e)-approximation algo- 
rithm for finding the densest subgraph in directed graphs. 
Recall that in directed graphs we are looking for two not 
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necessarily disjoint subsets S,T C F. We assume that the 
ratio c = IS^I/lT*! for the optimal sets S* ,T* is known 
to the algorithm. In practice, one can do a search for this 
value, by trying the algorithm for different values of c and 
retaining the best result. 

The algorithm then proceeds in a similar spirit as in the 
undirected case. We begin with S = T — V and remove 
either those nodes A(S) whose outdegree to T is below av- 
erage, or those nodes B(T) whose indegree to S is below av- 
erage. (Formally we need the degrees to be below a threshold 
slightly above the average for the algorithm to converge.) A 
naive way to decide whether the set A(S) or B(T) should 
be removed in the current pass is to look at the maximum 
outdegree, E(i*,T), of nodes in A(S) and the maximum in- 
degree, E(S,j*), of nodes in B(T). If E(S, j*)/E(i* ,T) > c 
then A(S) can be removed and otherwise B(T) can be re- 
moved. However, a better way is to make this choice di- 
rectly based on the current sizes of S and T. Intuitively, 
if |£|/|T| > c, then we should be removing the nodes from 
S to get the ratio closer to c, otherwise we should remove 
those from T. In addition to being simpler, this way is also 
faster mainly due to the fact that it needs to compute either 
A(S) or B(T) in every pass, leading to a significant speedup 
in practice. 

Algorithm 3 contains the formal description. 



Algorithm 3 Densest subgraph for directed graphs. 
Require: G = (V, E), c > 0, and e > 



2 
3 

4 

5 
6 

7: 

8 
9 
10 
11 
12 
13 
14 



S,T,S,T<-V 
while S and T do 
if |S|/|T| > c then 

A(S)^{ieS\ |£(i,T)|<(l + e)^fp} 

S^S\A{S) 
else 

B(T)^{jeT\\E(S,3)\<(l + e)^P±} 

T^T\B(T) 
end if 

if p(S,T) > p{S,f) then 
S<- S,f<-T 

end if 
end while 
return S,f 



First, we analyze the approximation factor of the algo- 
rithm. 



Lemma 12. Algorithm 3 leads to a (2 +2e)- approximation 
to the densest subgraph problem on directed graphs. 

PROOF. As in [10], we generate an assignment of the 
edges to the endpoints corresponding to the algorithm. When- 
ever (i, j) G E, and i G A(S) is removed from S, we assign 
(i,j) to i; a similar assignment is made for the nodes in 
B(T). Let p = p(S,T). Let deg* ut be the maximum outde- 
gree and deg* n be the maximum indegree in G. 

We need to show that if A(S) is removed, then 

VieA(S),V~c\E(i,T) \ < (l + e)p(S,T). 



Suppose that |S|/|T| > c, and so the nodes in A(S) will be 
removed. For all i G A(S), we have 



Vc\E(i,T)\<Vc-(l + e) 



\E(S,T)\ 
\S\ 



<Vc- (1 + e)\E{S,T)\ 



c\S\\T\ 



= (! + £) 



\E(S,T)\ 



= (l + e)p(S,T). 



The second line follows because \S\ > c\T\ => \S\ > ^/c\S\\T\. 
Similarly, one can show that if B(T) gets removed, then 

Vj G B(T), -^E(SJ) < (1 + e)p(S,T). 

This proves that in the given assignment (of edges to end- 
points), ^/cdeg^t < (1 + e)p and ^ deg* n < (1 + e)p. Once 
we have such an assignment, we can use the same logic as 
in Lemmas 7 and 8 in [10] to conclude that the algorithm 
gives a (2 + 2e)-approximation: 

max {p(S, T)} < (2 + 2e)p(S, f). □ 

S,TQV,\S\/\T\=c ~ 

Next, we analyze the number of passes of the algorithm. 
The proof is similar to that of Lemma 4. 

Lemma 13. Algorithm 3 terminate in 0(log 1+e n) passes. 
Proof. We have 

\E(S,T)\= J2 \ E (i' T )\+ E l^> T )l 
ieA(S) ies\A(S) 

AE(S,T)\ 



>(l + e )(|5|-|A(5)|)J 



\S\ 



which yields 



|5\A(S)| <T ^|5|. 
Similarly, we can prove 



\T\B(T)\ < 



1 + e 



|T|. 



Therefore, during each pass of the algorithm, either the size 
of the remaining set S or the size of the remaining set T goes 
down by a factor of at least l/(l + e). Hence, in 0(log 1+e n) 
passes, one of these sets becomes empty and the algorithm 
terminates. □ 

5. PRACTICAL CONSIDERATIONS 

In this section we describe two practical considerations in 
implementing the algorithms. The first (Section 5.1) is a 
heuristic method based on Count-Sketch to cut the memory 
requirements of the algorithm. The second (Section 5.2) is 
a discussion on how to realize the algorithms in the MapRe- 
duce computing model. 

5.1 Heuristic improvements 

We showed in Lemma 7 that any p-pass algorithm achiev- 
ing a 2-approximation to the densest subgraph problem must 
use at least fi(p) s P ace - However, even this amount of space 
can be prohibitively large for very large datasets. To fur- 
ther reduce the space required by the algorithms we turn to 
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sketching techniques that probabilistically summarize the 
degree distribution of the nodes. 

Recall that in order to decide whether to remove a particu- 
lar node, the algorithm only needs to be aware of its degree. 
This is the same as counting the number of edges in the 
stream that share this node as one of their endpoints. This 
exact problem of maintaining the frequencies of items in 
the stream using sublinear space was addressed by Charikar 
et al. [11]. They introduce the Count-Sketch data struc- 
ture, which maintains t independent estimates, each as a 
table on b buckets. For i = 1, . . . ,t, let hi : V — > [b] and 
§i : V — > {±1} be hash functions and for j = 1, . . . , 6, let 
Cij be counters initialized to zero. When an edge (x, y) ar- 
rives, for each i = 1, . . . , t, we update the counters as follows: 

Ci,hi(x) c i,h,(x) + 9i( x ) and Ci,hi(y) *~ c *,h t ( y ) + 9i(v)- Fi- 
nally, when queried for the final degree of a node x, we return 
the median among all of the estimates {c i>h .^ ■ gi{x)}\ =1 . 
(We refer the reader to [11] for the full details of the data 
structure; in our work, we merely use it as a black-box.) 

Charikar et al. [11] showed that this way of probabilisti- 
cally counting leads to a high precision counter for elements 
with high frequencies. Intuitively, this kind of a guarantee 
makes for a perfect fit with our application. We want to 
have good estimates for the nodes with high degrees (other- 
wise one may be removed prematurely). On the other hand, 
the false positive error of accidentally keeping a low-degree 
node is not as severe; a small number of low degree nodes 
will not have a dramatic impact on the size of the densest 
subgraph. As we show in Section 6 this intuition holds true 
in practice, and we find that the Count-Sketch enabled ver- 
sion of the algorithm, which uses a lot less space, sometimes 
(when lucky!) performs even better than the version using 
exact counting. 

5.2 MapReduce implementation 

All of the algorithms presented in this work depend on 
three basic functions: computing the density of the current 
graph, computing the degree of each individual node, and 
removing nodes with degree less than a specified threshold. 
The algorithm itself is very amenable to parallelism as long 
as these basic functions can exploit parallelism. For illustra- 
tion purposes, we focus on a specific distributed computing 
model that is widely used in practice, namely, the MapRe- 
duce model. We assume familiarity with the MapReduce 
model of computation and refer the reader to [16] for details. 
Finding the best parallel implementation of our algorithm 
is an interesting future research direction. 

Computing the density of the graph is a trivial opera- 
tion, as one needs only to count the total number of edges 
and nodes present. To compute the degree of every node in 
parallel, in the map step duplicate each edge (u, v) as two 
(key;value) pairs: (u;v) and (v;u). This way the input to 
every reduce task will be of the form (u; vi, «2, • • • , va) where 
V\, V2, ■ ■ ■ , Vd are the neighbors of u in G. The reducer can 
then count the number of associated values for each key, and 
output (u;deg(u)). 

Finally, the removal of the nodes with degree less than 
some threshold t, and their incident edges can be accom- 
plished in two MapReduce passes. In the first map phase, 
we mark all of the nodes slated for removal by adding a {v; $) 
key-value pair for all nodes v that are being removed. We 
map each edge (u, v) to (u;v). The reduce task associated 
with u then gets all of the edges whose first endpoint is u, 



and the symbol $ if the node was marked. In case the node 
is marked, the reduce task returns nothing, otherwise it just 
copies its input. In the second MapReduce pass we pivot 
on the second node in the edge description. Again, we only 
keep the edges incident on unmarked nodes. It is easy to see 
that the only edges that survive are exactly those incident 
on a pair of unmarked nodes. 

6. EXPERIMENTS 

In this section we detail the experiments and the results 
of the experiments for our algorithms. First, we describe 
the datasets used in our experiments (Section 6.1). These 
datasets are large social networks, some of which are pub- 
licly available for download or obtained through an API. 
Next, we study the accuracy of our algorithms when com- 
pared to the optimum. To this end, we obtain the optimum 
density value using a linear program, and compare the out- 
put of our algorithm to this optimum (Section 6.2). We 
then study the performance of the streaming version of our 
algorithm on both undirected and directed graphs. In par- 
ticular, we analyze the effect of e on the accuracy and the 
number of passes (Section 6.3 and Section 6.4). Finally, we 
remark (Section 6.5) on the space savings brought about 
by the sketching heuristic presented in Section 5.1 and on 
a proof-of-concept MapReduce implementation to compute 
the densest subgraph on an extremely large graph (Section 
6.6). 

Since we focus just on finding (or approximating) the 
densest subgraph, we do not try to enumerate all of the 
dense subgraphs in the given graph. It is easy to adapt our 
algorithm to iteratively enumerate node-disjoint (approxi- 
mately) densest subgraphs in the graph, with the guarantee 
that at each step of the enumeration, the algorithm will pro- 
duce an approximate solution on the residual graph. The 
quality of the resulting solution reflects more the properties 
of the underlying graph than our algorithm itself and hence 
we do not further explore this direction. 

6.1 Data description 

Almost all of our experiments are based on four large so- 
cial networks, namely, flickr, im, livejournal, and twit- 
ter, flickr is the social network corresponding to the 
Flickr (flickr.com) photosharing website, im is the graph 
induced by the contacts in Yahoo! messenger service, LIVE- 
JOURNAL is the graph induced by the friends in the Live- 
Journal (livejournal.com) social network, and twitter is 
the graph induced by the followers in the social media site 
Twitter (twitter.com). 

flickr is available publicly and can be obtained by using 
an API (www.flickr.com/services/api/). A smaller ver- 
sion of the im graph can be obtained via the Webscope pro- 
gram from webscope . sandbox .yahoo . com/catalog. php?dat 
atype=g. The version of livejournal used in our experi- 
ments can be downloaded from snap.stanford.edu/data/ 
soc-LiveJournall .txt .gz and the version of twitter used 
in our experiments can be obtained from an.kaist . ac .kr/ 
~haewoon/release/twitter_social_graph/. The details of 
the datasets are provided in Table 1. 

Note that when trying to measure the quality of our al- 
gorithms, the following two baselines do not make sense in 
the context of the above graphs: (i) computing the actual 
densest subgraph, which is infeasible for such large graphs 
and (ii) running the algorithm of [10], which would take 
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G 


type 


\v\ 


\E\ 


FLICKR 


undirected 


976K 


7.6M 


IM 


undirected 


645M 


6. IB 


LIVE JOURNAL 


directed 


4.84M 


68. 9M 


TWITTER 


directed 


50. 7M 


2.7B 



Table 1: Parameters of the graphs used in the ex- 
periments. 

quadratic time (linear time for each pass and a linear num- 
ber of passes), which is still infeasible for these graphs. In or- 
der to circumvent this, we work with slightly smaller graphs 
just to compare the quality of the solution to that of the 
optimum (Section 6.2). 

6.2 Quality of approximation 

We study how good of an approximation is obtained by 
our algorithm for the undirected case. To enable this, we 
need to compute the value of the optimum. Recall that, 
as mentioned in section 1, both the directed and undirected 
densest subgraph problems can be solved exactly using para- 
metric flow. In this section we want to obtain p*, i.e., the 
value of the optimal solution, to argue that the approxima- 
tion factor in practice is much better than 2(1 + e), guaran- 
teed by Lemma 3. (To do such a test for directed graphs is 
very expensive because one has to try all n 2 values of c.) 

In order to solve the densest subgraph problem exactly, 
we use the following linear programming (LP) formulation. 

max^ Xij 

V(i, j) G E,Xij < m 
e E,xi3 < Vi 

I><1 

i 

Xij, Hi > o 

Charikar [10] showed that the value of this LP is precisely 
equal to p* (G). We use this observation to measure the qual- 
ity of approximation obtained by Algorithm 1. To solve the 
LP, we use the COIN-OR CLP solver (projects. coin-or. 
org/Clp). We use seven moderately-sized undirected graphs 
publicly available at SNAP (snap.stanford.edu). Table 2 
shows the parameters of these graphs and the approxima- 
tion factor of our algorithms for different settings of e. It is 
clear that the approximation factors obtained by our algo- 
rithm are much better than what Lemma 3 promises. Fur- 
thermore, even high values of e seem to hardly hurt the 
approximation guarantees. 

6.3 Undirected graphs 

In this section we study the performance of our algorithms 
on two undirected graphs, namely, flickr and IM. First, we 
study the effect of e on the approximation factor and the 
number of passes. Figure 6.1 shows the results. For ease 
of comparison, we show the values relative to the density 
obtained by our algorithm for e = 0. (Note that the setting 
e = is similar to Charikar's algorithm [10] in terms of the 
approximation factor but can run in much fewer number of 
passes; however, termination is not guaranteed for e = 0.) 
As we saw in Table 2, the approximation does not deterio- 
rate for higher values of e (note that the performance is not 



£ vs approximation 




0.5 1 1.5 2 2.5 



e vs number of passes 




0.5 1 1.5 2 2.5 

e 

Figure 6.1: Effect of e on the approximation and the 
number of passes. 

monotone in e). Choosing a value of e £ [0.5, 1] seems to cut 
down the number of passes by half while losing only 10% of 
the optimum. 

We then move on to analyze the graph structure as the 
passes progress. Figure 6.2 shows the relative density as a 
function of the number of passes. (Curiously, we observe 
a unimodal behavior for flickr, but this does not seem to 
hold in general.) 

Figure 6.3 shows the number of nodes and edges in the 
graph after each pass. The shape of the plots suggests that 
the graph gets dramatically smaller even in the early passes. 
This is a very useful feature in practice, since if the graph 
gets very small early on, then the rest of the computation 
can be done in the main memory. This will avoid the over- 
head of additional passes. 

Note also that the worst-case bound of 0(log 1+e n) for 
the number of passes as given by Lemma 4 is never achieved 
by these graphs. This is possibly because of the heavy- 
tail nature of the degree distribution of graphs derived from 
social networks and their core connectivity properties; see 
[27, 30]. These properties may also contribute to achieving 
the good approximation ratio, i.e., the worst-case bound of 
Lemma 3 is not met by these graphs. Exploring these in 
further detail is outside the scope of this work and is an 
interesting area of future research. 

6.4 Directed graphs 

In this section we study the performance of the directed 
graph version of our algorithm. We use the livejournal 
and twitter graphs for this purpose. Recall that for di- 
rected graphs, we have to try for various values of c (Section 
4.3). Of course, trying all n 2 possible values of c is pro- 
hibitive. A simple alternative is to choose a resolution (S > 
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G={V,E) 


\v\ 


\E\ 


P*(G) 


p*(G)/p(G) 

6 = 0.001 6=0.1 


e = 1 


AS20000102 


6,474 


13,233 


9.29 


1.229 


1.268 


1.194 


ca-AstroPh 


18,772 


396,160 


32.12 


1.147 


1.156 


1.273 


ca-CondMat 


23,133 


186,936 


13.47 


1.072 


1.072 


1.429 


ca-GrQc 


5,242 


28,980 


22.39 


1.000 


1.000 


1.395 


ca-HepPh 


12,008 


237,010 


119.00 


1.000 


1.017 


1.151 


ca-HepTh 


9,877 


51,971 


15.50 


1.000 


1.000 


1.356 


email-Enron 


36,692 


367,662 


37.34 


1.058 


1.072 


1.063 



Table 2: Empirical approximation bounds for various values of e. 
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Figure 6.3: Number of nodes and edges in the graph after each step of the pass for flickr and IM. 



1) and try c at different powers of 8 (One can prove that this 
worsens the approximation guarantee by at most a factor 8 
[10]). Clearly, the running time is given by 2 log nj log 8. 
First, we study the effect of the choice of 8 compared to the 
choice of e. Table 3 shows the results. From the values, it is 
easy to see that as long as 8 remains reasonable, the effect 
of e is as in the undirected case. To make the rest of the 
study less cumbersome, we fix 8 — 2 for the remainder of 
this section. First, we present the results for livejournal. 



£ 




<5 






2 


10 


100 





325.27 


312.13 


307.96 


1 


334.38 


308.70 


306.91 


2 


294.50 


284.47 


179.59 



Table 3: livejournal: p for different 8 and e. 

We study the performance of the algorithm for various 
choices of c, given 8 — 2. In particular, we measure the 
density and the number of passes. Figure 6.4 shows the 
values. The behavior of density is quite complex, and for 



livejournal, the optimum occurs when the relative sizes 
of S and T are not skewed. 

Finally, Figure 6.5 shows the behavior of livejournal 
for the best setting of c (which is 0.436) for 8 — 2, e = 
1. It clearly shows the "alternate" nature of the simplified 
algorithm (Algorithm 3) that we developed in Section 4.3. 
As always, the number of nodes and edges fall dramatically 
as the passes progress. 

For twitter, we used e = 1, <5 = 2 and studied the per- 
formance of the algorithm for various values of c. Figure 
6.6 shows the density and the number of passes for vari- 
ous values of c. Unlike livejournal, the best value of c is 
not concentrated around 1. This may be due to the highly 
skewed nature of the twitter graph: for example, there 
are about 600 popular users who are followed by more than 
30 million other users. The results from livejournal and 
twitter suggest that, in practice, one can safely skip many 
values of c. 

6.5 Performance of sketching 

In this section we discuss the performance of the sketching 
heuristic presented in Section 5.1. We tested the algorithm 
on flickr, which has 976K nodes. Recall that the number 
of words in a Count-Sketch scheme using b buckets and t 
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FLICKR: p vs passes 




2 4 6 8 10 

passes 

Figure 6.2: Density as a function of the number of 
passes for various values of e, for flickr and im. 

independent hash tables is t x b. In Table 4, we show the 
ratio of the densest subgraph with and without sketching, for 
various values of b and e. The bottom row shows the main 
memory used by the algorithm with sketching compared to 
the algorithm without sketching. Clearly, for small values 
of e, the performance difference is not very significant even 
for b = 30000, which means only 5 x 30000/976K=16% of 
main memory is used. This suggests that, despite the space 
lower bounds (Lemma 7), in practice, a sketching scheme 
can obtain significant savings in main memory. 



e 


b = 30000 


b = 40000 


b = 50000 





1.047 


1.027 


1.014 


0.5 


0.960 


0.896 


0.921 


1 


0.958 


0.936 


0.918 


1.5 


0.890 


0.911 


0.929 


2 


0.760 


0.845 


0.869 


2.5 


0.787 


0.708 


0.740 


Memory 


0.16 


0.20 


0.25 



Table 4: Ratio of p with and without sketching for 

FLICKR (t = 5). 

6.6 MapReduce implementation 

In this section we study the performance of the MapRe- 
duce implementation of our algorithms for both directed and 
undirected graphs. For this purpose, we use the im and 
twitter graphs since they are too big to be studied under 
the semi-streaming model. We implemented our algorithms 
in Hadoop (hadoop . apache . org) and ran it with 2000 map- 
pers and 2000 reducers. Figure 6.7 shows the wall-clock 
running times for each pass for im, which is an undirected 
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Figure 6.4: Density and the number of passes at 

5 = 2 for LIVEJOURNAL. 

graph. It only takes under 260 minutes for our algorithm 
to run on im (a massive graph with more than half-billion 
nodes). For twitter, which is a directed graph, our algo- 
rithm takes around 35 minutes for a given value of c and for 
each iteration; Figure 6.6 shows that the number of itera- 
tions is between four and seven, and the number of values of 
c to be tried is very small. These clearly show the scalability 
of our algorithms. 

7. CONCLUSIONS 

In this paper we studied the problem of finding dense sub- 
graphs, a fundamental primitive in several data management 
applications, in streaming and MapReduce, two computa- 
tional models that are increasingly being adopted by large- 
scale data processing applications. We showed a simple al- 
gorithm that make a small number of passes over the graph 
and obtains a (2 + e)-approximation to the densest subgraph. 
We then obtained several extensions of this algorithm: for 
the case when the the subgraph is prescribed to be more 
than a certain size and when the graph is directed. To the 
best of our knowledge, these are the first algorithms for the 
densest subgraph problem that truly scale yet offer provable 
guarantees. Our experiments showed that the algorithms are 
indeed scalable and achieve quality and performance that is 
often much better than the theoretical guarantees. Our al- 
gorithm's scalability is the main reason it was possible to 
run it on a graph with more than a half a billion nodes and 
six billion edges. 
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IM: MapReduce time 




Figure 6.7: Time taken on IM graph in MapReduce. 
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Figure 6.5: Behavior of \S\, \T\, \E(S, T)\ for the best 
parameters of c, e for livejournal. 
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